data pattern
Pre-Training Estimators for Structural Models: Application to Consumer Search
Wei, Yanhao 'Max', Jiang, Zhenling
We develop pre-trained estimators for structural econometric models. The estimator uses a neural net to recognize the structural model's parameter from data patterns. Once trained, the estimator can be shared and applied to different datasets at negligible cost and effort. Under sufficient training, the estimator converges to the Bayesian posterior given the data patterns. As an illustration, we construct a pretrained estimator for a sequential search model (available at pnnehome.github.io). Estimation takes only seconds and achieves high accuracy on 12 real datasets. More broadly, pretrained estimators can make structural models much easier to use and more accessible.
- North America > United States > California (0.14)
- North America > Mexico > Quintana Roo > Cancún (0.05)
- North America > United States > Pennsylvania (0.04)
- (3 more...)
- Retail > Online (0.46)
- Information Technology > Security & Privacy (0.46)
VisMoDAl: Visual Analytics for Evaluating and Improving Corruption Robustness of Vision-Language Models
Wang, Huanchen, Zhang, Wencheng, Wang, Zhiqiang, Lu, Zhicong, Ma, Yuxin
Vision-language (VL) models have shown transformative potential across various critical domains due to their capability to comprehend multi-modal information. However, their performance frequently degrades under distribution shifts, making it crucial to assess and improve robustness against real-world data corruption encountered in practical applications. While advancements in VL benchmark datasets and data augmentation (DA) have contributed to robustness evaluation and improvement, there remain challenges due to a lack of in-depth comprehension of model behavior as well as the need for expertise and iterative efforts to explore data patterns. Given the achievement of visualization in explaining complex models and exploring large-scale data, understanding the impact of various data corruption on VL models aligns naturally with a visual analytics approach. To address these challenges, we introduce VisMoDAl, a visual analytics framework designed to evaluate VL model robustness against various corruption types and identify underperformed samples to guide the development of effective DA strategies. Grounded in the literature review and expert discussions, VisMoDAl supports multi-level analysis, ranging from examining performance under specific corruptions to task-driven inspection of model behavior and corresponding data slice. Unlike conventional works, VisMoDAl enables users to reason about the effects of corruption on VL models, facilitating both model behavior understanding and DA strategy formulation. The utility of our system is demonstrated through case studies and quantitative evaluations focused on corruption robustness in the image captioning task.
- Asia > China > Hong Kong (0.04)
- North America > United States (0.04)
DIM-SUM: Dynamic IMputation for Smart Utility Management
Hildebrant, Ryan, Bhope, Rahul, Mehrotra, Sharad, Tull, Christopher, Venkatasubramanian, Nalini
Time series imputation models have traditionally been developed using complete datasets with artificial masking patterns to simulate missing values. However, in real-world infrastructure monitoring, practitioners often encounter datasets where large amounts of data are missing and follow complex, heterogeneous patterns. We introduce DIM-SUM, a preprocessing framework for training robust imputation models that bridges the gap between artificially masked training data and real missing patterns. DIM-SUM combines pattern clustering and adaptive masking strategies with theoretical learning guarantees to handle diverse missing patterns actually observed in the data. Through extensive experiments on over 2 billion readings from California water districts, electricity datasets, and benchmarks, we demonstrate that DIM-SUM outperforms traditional methods by reaching similar accuracy with lower processing time and significantly less training data. When compared against a large pre-trained model, DIM-SUM averages 2x higher accuracy with significantly less inference time.
- North America > United States > California > Orange County > Irvine (0.04)
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
- North America > United States > California > Alameda County > Berkeley (0.04)
- Energy (1.00)
- Government > Regional Government (0.46)
- Water & Waste Management > Water Management > Water Supplies & Services (0.34)
SMOTE-DP: Improving Privacy-Utility Tradeoff with Synthetic Data
Zhou, Yan, Malin, Bradley, Kantarcioglu, Murat
Privacy-preserving data publication, including synthetic data sharing, often experiences trade-offs between privacy and utility. Synthetic data is generally more effective than data anonymization in balancing this trade-off, however, not without its own challenges. Synthetic data produced by generative models trained on source data may inadvertently reveal information about outliers. Techniques specifically designed for preserving privacy, such as introducing noise to satisfy differential privacy, often incur unpredictable and significant losses in utility. In this work we show that, with the right mechanism of synthetic data generation, we can achieve strong privacy protection without significant utility loss. Synthetic data generators producing contracting data patterns, such as Synthetic Minority Over-sampling Technique (SMOTE), can enhance a differentially private data generator, leveraging the strengths of both. We prove in theory and through empirical demonstration that this SMOTE-DP technique can produce synthetic data that not only ensures robust privacy protection but maintains utility in downstream learning tasks.
- North America > United States > New York > New York County > New York City (0.04)
- Europe > United Kingdom > England (0.04)
- North America > United States > Virginia > Montgomery County > Blacksburg (0.04)
- (5 more...)
Transferring self-supervised pre-trained models for SHM data anomaly detection with scarce labeled data
Zhou, Mingyuan, Jian, Xudong, Xia, Ye, Lai, Zhilu
Structural health monitoring (SHM) has experienced significant advancements in recent decades, accumulating massive monitoring data. Data anomalies inevitably exist in monitoring data, posing significant challenges to their effective utilization. Recently, deep learning has emerged as an efficient and effective approach for anomaly detection in bridge SHM. Despite its progress, many deep learning models require large amounts of labeled data for training. The process of labeling data, however, is labor-intensive, time-consuming, and often impractical for large-scale SHM datasets. To address these challenges, this work explores the use of self-supervised learning (SSL), an emerging paradigm that combines unsupervised pre-training and supervised fine-tuning. The SSL-based framework aims to learn from only a very small quantity of labeled data by fine-tuning, while making the best use of the vast amount of unlabeled SHM data by pre-training. Mainstream SSL methods are compared and validated on the SHM data of two in-service bridges. Comparative analysis demonstrates that SSL techniques boost data anomaly detection performance, achieving increased F1 scores compared to conventional supervised training, especially given a very limited amount of labeled data. This work manifests the effectiveness and superiority of SSL techniques on large-scale SHM data, providing an efficient tool for preliminary anomaly detection with scarce label information.
Discovery and Simulation of Data-Aware Business Processes
López-Pintado, Orlenys, Murashko, Serhii, Dumas, Marlon
Simulation is a common approach to predict the effect of business process changes on quantitative performance. The starting point of Business Process Simulation (BPS) is a process model enriched with simulation parameters. To cope with the typically large parameter spaces of BPS models, several methods have been proposed to automatically discover BPS models from event logs. Virtually all these approaches neglect the data perspective of business processes. Yet, the data attributes manipulated by a business process often determine which activities are performed, how many times, and when. This paper addresses this gap by introducing a data-aware BPS modeling approach and a method to discover data-aware BPS models from event logs. The BPS modeling approach supports three types of data attributes (global, case-level, and event-level) as well as deterministic and stochastic attribute update rules and data-aware branching conditions. An empirical evaluation shows that the proposed method accurately discovers the type of each data attribute and its associated update rules, and that the resulting BPS models more closely replicate the process execution control flow relative to data-unaware BPS models.
- Europe > Estonia > Tartu County > Tartu (0.05)
- Europe > Netherlands (0.04)
- Europe > Italy (0.04)
NuwaTS: a Foundation Model Mending Every Incomplete Time Series
Cheng, Jinguo, Yang, Chunwei, Cai, Wanlin, Liang, Yuxuan, Wu, Yuankai
Time series imputation plays a crucial role in various real-world systems and has been extensively explored. Models for time series imputation often require specialization, necessitating distinct designs for different domains and missing patterns. In this study, we introduce NuwaTS, a framework to repurpose Pre-trained Language Model (PLM) for general time series imputation. Once trained, this model can be applied to imputation tasks on incomplete time series from any domain with any missing patterns. We begin by devising specific embeddings for each sub-series patch of the incomplete time series. These embeddings encapsulate information about the patch itself, the missing data patterns within the patch, and the patch's statistical characteristics. To enhance the model's adaptability to different missing patterns, we propose a contrastive learning approach to make representations of the same patch more similar across different missing patterns. By combining this contrastive loss with the missing data imputation task, we train PLMs to obtain a one-for-all imputation model. Furthermore, we utilize a plug-and-play layer-wise fine-tuning approach to train domain-specific models. Experimental results demonstrate that leveraging a dataset of over seventeen million time series from diverse domains, we obtain a one-for-all imputation model which outperforms existing domain-specific models across various datasets and missing patterns. Additionally, we find that NuwaTS can be generalized to other time series tasks such as forecasting. Our codes are available at https://github.com/Chengyui/NuwaTS.
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- South America > Colombia > Meta Department > Villavicencio (0.04)
- North America > United States > California (0.04)
- (4 more...)
Causal Imputation for Counterfactual SCMs: Bridging Graphs and Latent Factor Models
Ribot, Alvaro, Squires, Chandler, Uhler, Caroline
We consider the task of causal imputation, where we aim to predict the outcomes of some set of actions across a wide range of possible contexts. As a running example, we consider predicting how different drugs affect cells from different cell types. We study the index-only setting, where the actions and contexts are categorical variables with a finite number of possible values. Even in this simple setting, a practical challenge arises, since often only a small subset of possible action-context pairs have been studied. Thus, models must extrapolate to novel action-context pairs, which can be framed as a form of matrix completion with rows indexed by actions, columns indexed by contexts, and matrix entries corresponding to outcomes. We introduce a novel SCM-based model class, where the outcome is expressed as a counterfactual, actions are expressed as interventions on an instrumental variable, and contexts are defined based on the initial state of the system. We show that, under a linearity assumption, this setup induces a latent factor model over the matrix of outcomes, with an additional fixed effect term. To perform causal prediction based on this model class, we introduce simple extension to the Synthetic Interventions estimator (Agarwal et al., 2020). We evaluate several matrix completion approaches on the PRISM drug repurposing dataset, showing that our method outperforms all other considered matrix completion approaches.
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
- Asia > Middle East > Jordan (0.04)
- Europe > Spain > Basque Country (0.04)
- Health & Medicine > Pharmaceuticals & Biotechnology (0.67)
- Health & Medicine > Therapeutic Area (0.46)
- Government > Regional Government (0.46)
Graph-based Forecasting with Missing Data through Spatiotemporal Downsampling
Marisca, Ivan, Alippi, Cesare, Bianchi, Filippo Maria
Given a set of synchronous time series, each associated with a sensor-point in space and characterized by inter-series relationships, the problem of spatiotemporal forecasting consists of predicting future observations for each point. Spatiotemporal graph neural networks achieve striking results by representing the relationships across time series as a graph. Nonetheless, most existing methods rely on the often unrealistic assumption that inputs are always available and fail to capture hidden spatiotemporal dynamics when part of the data is missing. In this work, we tackle this problem through hierarchical spatiotemporal downsampling. The input time series are progressively coarsened over time and space, obtaining a pool of representations that capture heterogeneous temporal and spatial dynamics. Conditioned on observations and missing data patterns, such representations are combined by an interpretable attention mechanism to generate the forecasts. Our approach outperforms state-of-the-art methods on synthetic and real-world benchmarks under different missing data distributions, particularly in the presence of contiguous blocks of missing values.
- North America > United States > New York > New York County > New York City (0.04)
- Europe > United Kingdom > England (0.04)
- Asia > China (0.04)
- (3 more...)
- Information Technology > Data Science > Data Quality (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
- Information Technology > Data Science > Data Mining (0.93)